Processing forms using artificial intelligence models

ABSTRACT

An application server may receive an input document including a set of input text fields and an input key phrase querying a value for a key-value pair that corresponds to one or more of the set of input text fields. The application server may extract, using an optical character recognition model, a set of character strings and a set of two-dimensional locations of the set of character strings on a layout of the input document. After extraction, the application server may input the extracted set of character strings and the set of two-dimensional locations into a machine learned model that is trained to compute a probability that a character string corresponds to the value for the key-value pair. The application server may then identify the value for the key-value pair corresponding to the input key phrase and may out the identified value.

FIELD OF TECHNOLOGY

The present disclosure relates generally to database systems and dataprocessing, and more specifically to processing forms using artificialintelligence models.

BACKGROUND

A cloud platform (i.e., a computing platform for cloud computing) may beemployed by many users to store, manage, and process data using a sharednetwork of remote servers. Users may develop applications on the cloudplatform to handle the storage, management, and processing of data. Insome cases, the cloud platform may utilize a multi-tenant databasesystem. Users may access the cloud platform using various user devices(e.g., desktop computers, laptops, smartphones, tablets, or othercomputing systems, etc.).

In one example, the cloud platform may support customer relationshipmanagement (CRM) solutions. This may include support for sales, service,marketing, community, analytics, applications, and the Internet ofThings. A user may utilize the cloud platform to help manage contacts ofthe user. For example, managing contacts of the user may includeanalyzing data, storing and preparing communications, and trackingopportunities and sales.

Systems may use or otherwise support fillable forms having fields forinput data and a variety of formats (e.g., order forms, invoices, etc.).A user may use the cloud platform to query for and extract meaningfulinformation from a fillable form. In some systems, the form may have aspecific template and the user may be limited to querying using specificterms or query formats. However, in cases with no predefined templatesfor reference, it is challenging to automatically extract information ofinterest from forms. Thus, techniques for extracting information fromforms having different formats may be improved.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of a form processing at a server systemthat supports processing forms using artificial intelligence models inaccordance with aspects of the present disclosure.

FIG. 2 illustrates an example of a computing system that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

FIG. 3 illustrates an example of a process flow that supports processingforms using artificial intelligence models in accordance with aspects ofthe present disclosure.

FIG. 4 illustrates an example of an input document that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

FIG. 5 illustrates an example of a process flow that supports processingforms using artificial intelligence models in accordance with aspects ofthe present disclosure.

FIG. 6 shows a block diagram of an apparatus that supports processingforms using artificial intelligence models in accordance with aspects ofthe present disclosure.

FIG. 7 shows a block diagram of a processing component that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

FIG. 8 shows a diagram of a system including a device that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

FIGS. 9 through 11 show flowcharts illustrating methods that supportprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

DETAILED DESCRIPTION

An organization may store information and data for users (e.g.,customers, organizations, etc.) such as data and metadata for exchanges,opportunities, orders, invoices, deals, assets, customer information,and the like. Some data storage and processing systems may receive orstore data using forms with fillable fields, and some systems maysupport a variety of formats of fillable forms (e.g., order forms,invoices, etc.). Forms can be classified into at least two categories interms of their layout flexibility: fixed forms and non-fixed forms. Afixed form may be defined as a form that has limited structurevariations in terms of layouts, texts, and visual appearance. Forexample, a driver license may be considered a fixed form, since eachstate has a limited number of designs and formats for its driverlicense. A non-fixed form may be defined as a form that has a non-fixedstructure or is otherwise flexible in terms of its layout and content.For example, an invoice may be considered a non-fixed form as eachvendor may have their own design of invoices with different layouts,texts, and visual appearances. Systems may be configured toautomatically ingest forms and extract information from them such thatthe information can be queried, stored, or otherwise processed. However,such systems may require a template or other reference to guide thesystem in understanding which fields correspond to which types ofinformation. Such systems are thereby limited in their utility if thereare no predefined templates for reference or if the system is taskedwith ingesting non-fixed forms with varying formats. Therefore, it ischallenging to automatically extract information of interest fromnon-fixed forms using current systems.

Techniques of the present disclosure provide for an automatic system toextract information of interest from fixed forms and non-fixed forms,thus improving the processing efficiency of documents having differentformats. The techniques described herein provide for a method ofextracting key-value pairs from arbitrary non-fixed forms based onspecified requests (e.g., queries) by users. The system, which mayinclude a database system, one or more application servers, a cloudplatform, or any combination of computing devices and architectures asdescribed herein, may use an artificial intelligence model (e.g., amachine learned model) applicable to arbitrary types of fixed forms andnon-fixed forms. Users may use the artificial intelligence model toextract key-value pairs of interest or infer the value for arbitrarykeys specified by users. To retrieve a value corresponding to an inputphrase, the system may use an artificial intelligence model inconjunction with an image text extractor (e.g., an optical characterrecognition model).

The system may receive a user input including an input document (such asa form, a set of forms, etc.) and an input key phrase (e.g., a query).The input document may include a set of input text fields. Uponreceiving the input form, the system may extract, using an opticalcharacter recognition model (or similar image or text processing model),a set of character strings and a set of two-dimensional locations of theset of character strings on a layout of the input document. Forinstance, the system may process the input form (e.g., input as an imageor text file) to detect and recognize the words [w1, w2, . . . , wM] andtheir locations in [b1, b2, . . . , bM] in the image. The system maythen input the words (in the form of character strings or groups ofcharacters), their corresponding locations, and the input key phraseinto an artificial intelligence model (e.g., transformer-based model).The artificial intelligence model may be an example of a machine learnedmodel that is trained to compute a probability, for each characterstring of the set of character strings, whether that a character stringcorresponds to the value for the key-value pair corresponding to theinput key phrase. In some examples, the system may then generate aprobability of each word being a value corresponding to the requestedkey. In examples where the value includes multiple words, the system maygroup potential value words into phrases based on the output of theartificial intelligence model and the spatial arrangement of the words.The system may identify the value for the key-value pair correspondingto the input key phrase based on inputting the extracted set ofcharacter strings and the set of two-dimensional locations into themachine learned model. In some examples, the system may output thematching value phrase for the input key phrase.

Aspects of the disclosure are initially described in the context of anenvironment supporting an on-demand database service. Aspects of thedisclosure are further described with respect to a general systemdiagram that shows computing components and data flows that supportprocessing forms using artificial intelligence models, a block diagramillustrating a user interface, and a process flow diagram illustratingvarious process and dataflows that support the techniques herein.Aspects of the disclosure are further illustrated by and described withreference to apparatus diagrams, system diagrams, and flowcharts thatrelate to processing forms using artificial intelligence models.

FIG. 1 illustrates an example of a system 100 for cloud computing thatsupports processing forms using artificial intelligence models inaccordance with various aspects of the present disclosure. The system100 includes cloud clients 105, contacts 110, cloud platform 115, anddata center 120. Cloud platform 115 may be an example of a public orprivate cloud network. A cloud client 105 may access cloud platform 115over network connection 135. The network may implement transfer controlprotocol and internet protocol (TCP/IP), such as the Internet, or mayimplement other network protocols. A cloud client 105 may be an exampleof a user device, such as a server (e.g., cloud client 105-a), asmartphone (e.g., cloud client 105-b), or a laptop (e.g., cloud client105-c). In other examples, a cloud client 105 may be a desktop computer,a tablet, a sensor, or another computing device or system capable ofgenerating, analyzing, transmitting, or receiving communications. Insome examples, a cloud client 105 may be operated by a user that is partof a business, an enterprise, a non-profit, a startup, or any otherorganization type.

A cloud client 105 may interact with multiple contacts 110. Theinteractions 130 may include communications, opportunities, purchases,sales, or any other interaction between a cloud client 105 and a contact110. Data may be associated with the interactions 130. A cloud client105 may access cloud platform 115 to store, manage, and process the dataassociated with the interactions 130. In some cases, the cloud client105 may have an associated security or permission level. A cloud client105 may have access to some applications, data, and database informationwithin cloud platform 115 based on the associated security or permissionlevel, and may not have access to others.

Contacts 110 may interact with the cloud client 105 in person or viaphone, email, web, text messages, mail, or any other appropriate form ofinteraction (e.g., interactions 130-a, 130-b, 130-c, and 130-d). Theinteraction 130 may be a business-to-business (B2B) interaction or abusiness-to-consumer (B2C) interaction. A contact 110 may also bereferred to as a customer, a potential customer, a lead, a client, orsome other suitable terminology. In some cases, the contact 110 may bean example of a user device, such as a server (e.g., contact 110-a), alaptop (e.g., contact 110-b), a smartphone (e.g., contact 110-c), or asensor (e.g., contact 110-d). In other cases, the contact 110 may beanother computing system. In some cases, the contact 110 may be operatedby a user or group of users. The user or group of users may beassociated with a business, a manufacturer, or any other appropriateorganization.

Cloud platform 115 may offer an on-demand database service to the cloudclient 105. In some cases, cloud platform 115 may be an example of amulti-tenant database system. In this case, cloud platform 115 may servemultiple cloud clients 105 with a single instance of software. However,other types of systems may be implemented, including—but not limitedto—client-server systems, mobile device systems, and mobile networksystems. In some cases, cloud platform 115 may support CRM solutions.This may include support for sales, service, marketing, community,analytics, applications, and the Internet of Things. Cloud platform 115may receive data associated with contact interactions 130 from the cloudclient 105 over network connection 135, and may store and analyze thedata. In some cases, cloud platform 115 may receive data directly froman interaction 130 between a contact 110 and the cloud client 105. Insome cases, the cloud client 105 may develop applications to run oncloud platform 115. Cloud platform 115 may be implemented using remoteservers. In some cases, the remote servers may be located at one or moredata centers 120.

Data center 120 may include multiple servers. The multiple servers maybe used for data storage, management, and processing. Data center 120may receive data from cloud platform 115 via connection 140, or directlyfrom the cloud client 105 or an interaction 130 between a contact 110and the cloud client 105. Data center 120 may utilize multipleredundancies for security purposes. In some cases, the data stored atdata center 120 may be backed up by copies of the data at a differentdata center (not pictured).

Subsystem 125 may include cloud clients 105, cloud platform 115, anddata center 120. In some cases, data processing may occur at any of thecomponents of subsystem 125, or at a combination of these components. Insome cases, servers may perform the data processing. The servers may bea cloud client 105 or located at data center 120.

The data center 120 may be an example of a multi-tenant system thatsupports data storage, retrieval, data analytics, and the like forvarious tenants, such as the cloud clients 105. As such, each cloudclient 105 may be provided with a database instance in the datacenter120, and each database instance may store various datasets that areassociated with the particular cloud client 105. More particularly, eachcloud client 105 may have a specific set of datasets that are unique forthe cloud client 105. The cloud platform and datacenter 120 may supporta system that processes a set of datasets for a particular cloud client105. In some examples, the cloud platform and datacenter 120 support asystem that receives an input document and an input key phrase from aparticular cloud client 105 and generates a value for the input keyphrase based on a machine learned model. In some examples, the input keyphrase may be received as a natural language query. As such, the valuecorresponding to the input key phrase is based on a set of characterstrings (words or phrases) and a set of two-dimensional locations of theset of character strings on a layout of the input document. That is, thevalue determination in response to inputting a form may support customerspecific analytics by capturing contexts or meanings that are unique toa form type and the cloud client 105.

Forms are common in daily business workflows. A large amount of humaneffort is needed to process the massive number of form-like documents.Developing an automatic system to extract information of interest fromforms may improve the processing efficiency. As described above, a fixedform may be defined as a form that has limited structure variations interms of layouts, texts, and visual appearance. For example, a driver'slicense may be considered as a fixed form, since each state has onedesign (or a limited number of designs) of its driver license. The fixedstructure of these forms may be used to utilize some predefinedtemplates to extract information of interest. On the other hand, anon-fixed form may be defined as a form that has non-fixed structures.For example, an invoice may be considered as a non-fixed form given thefact that each vendor will have their own design of invoices withdifferent layouts, texts, and visual appearances. Since there are nopredefined templates for reference, it may be challenging to extractinformation of interest from non-fixed forms.

Some techniques may be utilized to extract information in the form ofkey-value pairs from non-fixed forms. In some examples, a system mayextract all the key-value pairs from the form without considering theinterest of users. After that, the users may manually select informationfrom the redundant results. For instance, a system may receive a form asan input, and may identify a mapping between keys and values included inthe form. However, there is no way for a user to query a value for aparticular key included in the form. In this case, the user may receivea one to one mapping of keys and values, and may have to manually sortthrough the mapping in order to identify the requested key and determinetheir corresponding value. In another example, a system may extractkey-value pairs of predefined field categories on invoices. However,such techniques for information extraction may work for a pre-fixed setof fields and may be specially designed for invoices and may not be usedfor other types of non-fixed forms. In some examples, a system mayextract values for customized keys. However, such an extractiontechnique may depend on more information by users (such as the specifickey's data type) and may not be able to handle virtual keys and it islimited for key variations. In some examples, virtual keys may bedefined as keys that are not included in a document and key variationmay occur when an input key is not an exact match with any key includedin a document. Thus, the system configured to extract values forcustomized keys from a document may not be able to handle an input keythat is not included in the document.

As described herein, the datacenter 120 and cloud platform 115 maysupport processing forms using artificial intelligence models using akey text as the input and can handle virtual keys and key variations.For instance, the datacenter 120 and cloud platform 115 may supportreceiving a query for an input key associated with a document, where theinput key is not included in the document. Additionally oralternatively, the datacenter 120 and cloud platform 115 may supportreceiving a query for an input key associated with a document, where theinput key is not an exact match with any key included in the document.In some examples, a system may receive an input document including a setof input text fields and an input key phrase querying a value for akey-value pair that corresponds to one or more of the set of input textfields. The system may use a machine learned model to determine a valuefor a key-value pair corresponding to the input key phrase. In someexamples, the system may input the input key phrase into the machinelearned model. The system may then identify a set of probabilities forthe set of character strings being the value for the key-value paircorresponding to the input key phrase. The system may then output theidentified value corresponding to the input key phrase.

It should be appreciated by a person skilled in the art that one or moreaspects of the disclosure may be implemented in a system 100 toadditionally or alternatively solve other problems than those describedherein. Furthermore, aspects of the disclosure may provide technicalimprovements to “conventional” systems or processes as described herein.However, the description and appended drawings only include exampletechnical improvements resulting from implementing aspects of thedisclosure, and accordingly do not represent all of the technicalimprovements provided within the scope of the claims.

FIG. 2 illustrates an example of a computing system 200 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The computing system 200 includes auser device 205 and a server 210. The user device 205 may be an exampleof a device associated with a cloud client 105 or contact 110 of FIG. 1. The server 210 may be examples of aspects of the cloud platform 115and the datacenter 120 of FIG. 1 . For example, the server 210 mayrepresent various devices and components (e.g., application servers,databases, cloud storage, etc.) that support an analytical data systemas described herein. The server 210 may support a multi-tenant databasesystem, which may manage various datasets 225 that are associated withspecific tenants (e.g., cloud clients 105). In some examples, thedatasets 225 may include a set of forms related to the tenant. In someexamples, the server 210 may be configured to support a singleorganization or tenant instead of being configured as a multi-tenantsystem. The server 210 may also support data retrieval in response toinput 215 (e.g., queries) received from user devices, such as userdevice 205. For example, the server 210 may support retrieving a valuefrom a form based on receiving an input key. The data (e.g., value to acorresponding input key) retrieved in response to an input 215 may besurfaced to a user at the user device 205.

As described, the server 210 may manage various datasets 225 includingforms having different formats. The datasets 225 may be associated withspecific tenants in the example of a multi-tenant system. For example, adatastore may store a set of datasets 225 that are associated with thetenant corresponding to user device 205. A dataset of the set ofdatasets 225 may include a fillable form or multiple forms. As depictedherein, the computing system 200 may support a variety of formats offillable forms including fixed forms and non-fixed forms (e.g., orderforms, invoices, etc.). Some computing systems may not be able toautomatically extract information from forms having different formats.To support automatic extraction of information from forms, a datapreprocessor 230 may identify fields from the forms in a dataset 225.The datasets 225 may store training data including an indication of oneor more fields of a form (e.g., a key) corresponding to a related field(e.g., a value) according to relationships between the fields. Thetraining data may be forwarded to the training function 245. Accordingto one or more aspects, the training function 245 may utilize a set offorms (having different formats) to train a machine learned model. Insome examples, the training function 245 may receive a set of trainingforms (e.g., input documents) from the dataset 225 and may extract a setof key-value pairs from the set of training forms or input files (storedin dataset 225). For instance, the training function 245 may train amodel to identify a value corresponding to a key in an input document.The training function 245 may utilize labeled data in the set oftraining forms to identify a value for a corresponding key.

The training function 245 may train the machine learned model 235 (orsome other machine learned model) based on inputting a set of input fileformats into a transformer-based model. The transformer-based model isdescribed in further detail with reference to FIG. 3 . For each word inthe set of training forms or input files, an annotation l may indicatewhether the word w_(i) ∈ {0, 1} is a part of the value phrasecorresponding to an input key phrase. During the training operation, thetraining function 245 may train the machine learned model 235 tocalculate a binary cross entropy loss between a predicted probabilityand a ground-truth label using the following equation:

${loss} = {\sum\limits_{i = 1}^{M}\left( {{l_{w_{i}}\log{p\left( {{value}{❘{w_{i},{{key} - {phrase}},{document}}}} \right)}} + {\left( {1 - l_{w_{i}}} \right)\left( {1 - {p\left( {{value}{❘{l_{w_{i}},{{key} - {phrase}},{document}}}} \right)}} \right.}} \right.}$

In the equation, l_(w) _(i) is defined as an annotation for a word w_(i)and key-phrase is a key for which the value is calculated. The trainingfunction 245 may send the trained model to machine learned model 235.Accordingly, the machine learned model 235 may be trained to identify avalue for a particular key included in an input document.

According to one or more aspects of the present disclosure, the formsmay be associated with at least one of reports, report types, dataobjects, data sets, or a combination thereof. In some examples, thecomputing system 200 may support analytics to extract meaningfulinformation from different types of forms. According to aspectsdescribed herein, the data preprocessor 230 may receive one or moreinputs 215 (e.g., queries such as natural language queries or databasequeries). The one or more inputs 215 may also include an input documentincluding a set of input text fields. A user using the user device 205may upload an input form including a set of input text fields. The inputform may be of a fixed format or a non-fixed format. In some instances,each input document may include a set of fields. The data preprocessor230 may receive the input 215 and may convert the input document into aninput understandable by the data preprocessor 230.

In some examples, the data preprocessor 230 may receive an input keyphrase 215-a in addition to the input document. The input key phrase215-a may query a value for a key-value pair that corresponds to one ormore of the set of input text fields (e.g., the set of input text fieldsincluded in the input document received at the data preprocessor 230).In some examples, the data preprocessor 230 may extract, using anoptical character recognition model, a set of character strings and aset of two-dimensional locations of the set of character strings on alayout of the input document. The data preprocessor 230 may then inputthe extracted set of character strings and the set of two-dimensionallocations into a machine learned model (at the machine learned model235) that is trained (using the training function 245) to compute aprobability that a character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase 215-a. As depicted herein, prior to receiving the input215, the training function 245 may train a machine learned model basedon inputting a set of input file formats into the machine learned model.For instance, the training function 245 may receive a set of trainingforms including labeled data identifying keys and corresponding valuesfor the keys. Based on the labeled data, the training function 245 maytrain the machine learned model 245 to identify a value corresponding toa key in an arbitrary document. As one method of identification, thetraining function 245 may train the machine learned model 245 to computea probability of multiple potential words or phrases being a valuecorresponding to a key. The training function 245 may further train themachine learned model 245 to rank the computed probabilities to identifya value for a requested key.

In some examples, the data preprocessor 230 may input the input keyphrase 215-a into the machine learned model. The machine learned model235 (as trained by the training function 245) may identify a set ofprobabilities for the set of character strings being the value for thekey-value pair corresponding to the input key phrase 215-a. In someinstances, the data postprocessor 240 may rank the set of probabilitiesbased on a value of each probability in the set of probabilities. Thedata postprocessor 240 may then identify the value for the key-valuepair corresponding to the input key phrase 215-a. In some examples,identifying the value for the key-value pair corresponding to the inputkey phrase 215-a may be based on ranking the set of probabilities. Forinstance, the data postprocessor 240 may identify the valuecorresponding to the input key phrase 215-a as the value having ahighest computed probability. For example, the machine learned model 235may rank probabilities of a value for the key-value pair correspondingto the input key phrase 215-a and the data postprocessor 240 mayidentify a result (e.g., having a top ranked probability). Uponidentifying the value corresponding to the requested input key phrase215-a, the data postprocessor 240 may transmit the identified value (inresults 220) to the user device 205. As such, the result 220 includingthe identified value corresponding to the input key phrase 215-a may bereturned to the user. The concepts and techniques described withreference to FIG. 2 are further described with respect to the followingfigures.

FIG. 3 illustrates an example of a process flow 300 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The process flow diagram 300 may beimplemented in conjunction with a user device and a server (e.g., anapplication server or a combination of computing devices as describedherein). The user device may be an example of the user device 205 asdescribed with respect to FIG. 2 , and the server may be an example ofthe server 210 as described with respect to FIG. 2 . Although one userdevice is depicted in the example of FIG. 3 , it may be understood thatthe process flow 300 may be implemented using multiple user devices. Theserver may represent a set of computing components, data storagecomponents, and the like, that support a database system as describedherein. In some examples, the database system may be configured as amulti-tenant database system as described herein.

In some examples, the operations illustrated in the process flow 300 maybe performed by hardware (e.g., including circuitry, processing blocks,logic components, and other components), code (e.g., software orfirmware) executed by a processor, or any combination thereof.Alternative examples of the following may be implemented, where somesteps are performed in a different order than described or are notperformed at all. In some cases, steps may include additional featuresnot mentioned below, or further steps may be added.

The aspects depicted herein provide a method for extracting key-valuepairs from arbitrary non-fixed forms based on requests from users. Theprocess flow 300 may be applicable to arbitrary types of non-fixed formsand users may be flexible to extract key-value pairs of interest. Asdescribed with reference to FIG. 3 , the process flow 300 utilizes atransformer-based machine learned model (e.g., machine learned model235) to infer the value for arbitrary keys specified by users. In someexamples, the transformer-based machine learned model may be a deeplearning model that adopts the mechanism of differentially weighingsignificance of different portions of an input data. Transformer-basedmachine learned models may be used in the field of natural languageprocessing. While the process flow 300 is described with reference to atransformer-based machine learned model, it is to be understood thatother machine learned models may be used for extracting key-value pairsfrom arbitrary non-fixed forms based on requests from users.

At 305, a user may provide an input document including a set of inputtext fields. The input document may be stored in a database, server,cloud storage, or any other form of data storage as described withreference to FIGS. 1 and 2 . The input document may be an example of afixed form or a non-fixed form. The form may include a set of textfields and a set of values corresponding to the text fields. The textfields and values may be considered as key-value pairs such that a fieldmay be considered a key and the text, numbers, or data in that field maybe considered the value corresponding to the key. A specific example ofa form with key-value pairs is provided with reference to FIG. 4 .

At 310, the user may input an input key phrase querying a value for akey-value pair that corresponds to one or more of the set of input textfields. For example, the user may upload a form and may query a valuecorresponding to a key included in the form. In some instances, a usermay upload a form F including keys K1, K2 and K3. Each key in the formmay have a value associated with it. In one example, the form mayinclude value V1 corresponding to key K1, value V2 corresponding to keyK2 and value V3 corresponding to key K3. The user may query the valuefor input key phrase K2. In some examples, the form may have alreadybeen input or uploaded (e.g., previously, or by another user, etc.), andthe user at 310 may provide the input key phrase for a queryingoperation.

At 315, an optical character recognition model may extract from theinput document a set of character strings and a set of two-dimensionallocations of the set of character strings on a layout of the inputdocument. With reference to the prior example form, the opticalcharacter recognition model may extract the words K1, K2, K3, V1, V2,and V3. Although each value is depicted as a single word, it is to beunderstood that a single value phrase may include multiple words. In thecase where a single value phrase includes multiple words, the opticalcharacter recognition model may identify each word separately. Theoptical character recognition model may process the input document(e.g., an image of a form) to detect and recognize optical characterrecognition words [w1, w2, . . . , wM] and their corresponding locations[b1, b2, . . . , bM] in the input document. In some examples, thelocations may be x-y coordinates in the input document. Referring toform F, the optical character recognition model may detect and recognizeoptical character recognition words [K1, K2, K3, V1, V2, V3] and theircorresponding locations [x1y1, x2y2, x3y3, x4y4, x5y5, x6y6]. Althoughan optical character recognition model is provide as an example, it isto be understood that any model capable of extracting text, values, orinformation from an image or text file may be used at 315.

At 320, the set of character strings and the set of two-dimensionallocations of the set of character strings may be input into a machinelearned model such as a transformer-based model. As described herein,the transformer-based model may be an example of a machine learned modelthat is trained to compute a probability that a character string of theset of character strings corresponds to the value for the key-value paircorresponding to the input key phrase. That is, the machine learnedmodel may compute a probability for each character string, where theprobability indicates how likely that character string is the value thatcorresponds to the key in the form that corresponds to the key phraseinput by the user.

The transformer-based model may receive the words generated by theoptical character recognition along with their corresponding locationsand the requested key phrase as inputs and may generate the probabilityof each word being the value corresponding to the requested key or keyphrase. In some examples, a server or other computing device executingthe transformer-based model may tokenize the input key phrase intowords, [kw1, kw2, . . . , kwN], where k may be a constant. In someexamples, [kw1, kw2, . . . , kwN] and [w1+b1, w2+b2, . . . , wN+bN] maybe inputted to the transformer-based model. As depicted herein, w1included in the optical character recognition words [w1, w2, . . . , wM]and b1 included in their corresponding locations [b1, b2, . . . , bM]correspond to the word generated with the optical character recognitionmodel and its locations in the input document.

At 325, a server or other computing device may use the transformer-basedmodel to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase. In some examples, the input keyphrase querying a value for a key-value pair may be associated withlocation information. In such cases where the input phrase has nolocation information, the transformer-based model may use a dummylocation [0,0,0,0] for each key word to fit the transformer's inputparameter. The transformer-based model may first generate a featurerepresentation for each input key word and each optical characterrecognized word. In some examples, the feature representation may use aset of techniques that allows a system to perform feature detection orclassification from raw data. The feature representation may include anarray with a certain length (N) that represents the words (each inputkey word and each optical character recognized word) in a way that thetransformer-based model can process. In the example of form F, thetransformer-based model may generate feature representation for theinput key word and the optical character recognition words [K1, K2, K3,V1, V2, V3] may be [fkw1′, fkw2′, fkw3′, fkw4′, fkw5′, fkw6′].

In some examples, a server or other computing device may generate afirst set of feature representations for a set of keywords included inthe set of input text fields and a second set of feature representationsfor the extracted set of character strings. For example, the server maygenerate the optical character recognized word's feature representationas [fkw1′, fkw2′, . . . , fkwN′]. The optical character recognizedword's representation [fw1′, fkw2′, . . . , fkwN′] may further beprojected by a fully connected layer leading to [fw1, fw2, . . . , fwM],where M is a total number of optical character recognition words in theinput document (e.g., form F). For instance, the transformer-based modelmay project the feature representation array [fkw1′, fkw2′, . . . ,fkwN′] to a fully connected layer generating an array [fw1, fw2, . . . ,fwM].

In some examples, the server may generate a unified featurerepresentation for the input key phrase on the first set of featurerepresentations and the second set of feature representations In someinstances, the server may generate a unified representationf_(key-phrase) for the input key phrase by first averaging the featuresof all the key words and then projecting the averaged representation toanother space by a fully connected layer. In some examples, the fullyconnected layer may be a normalization layer that receives an array asan input and provides a normalized term as an output. In the exampledepicted herein, the fully connected layer may receive an arrayincluding a normalized representation of the keywords and may generate aunified term (or representation) for the keywords in the input document.In some instances, the fully connected layers may be examples of encoderlayers of the transformer-based model.

The transformer-based model, as part of an inference procedure, may thendetermine a probability of each extracted word being the value for theinput key phrase. For example, the transformer-based model may identifya set of probabilities for the set of character strings being the valuefor the key-value pair corresponding to the input key phrase. In someexamples, the transformer-based model may apply a dot product betweenthe unified feature representation for the input key phrase and eachfeature representation of the second set of feature representations. Insome examples, the probability that the character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase is computed based on applying thedot product.

In some examples, the transformer-based model may obtain a matchingscore between the input key phrase and each of the extracted word. Thematching score may be obtained by applying dot product between therepresentation of key phrase and each of the extracted word's featurerepresentation. The transformer-based model may obtain the matchingprobability by applying the sigmoid function to the matching score usingthe following equation:

p(value|w _(i), key-phrase, document)=sigmoid(f _(key-phrase) ·f _(w)_(i) )

In the equation, the probability that a word w_(i) is a valuecorresponding to the input key phrase in the document is given by asigmoid function to a dot product between the unified representationf_(key-phrase) and the feature representation word w_(i) (shown as f_(w)_(i) ).

In some examples, the transformer-based model may identify a set ofprobabilities for the set of character strings being the value for thekey-value pair corresponding to the input key phrase. In the example ofform F, the transformer-based model may determine that the value V1 hasa probability p1 of being the value corresponding to the input keyphrase K2. Similarly, the transformer-based model may determine thevalue V2 has a probability p2 of being the value corresponding to theinput key phrase K2 and that the value V3 has a probability p3 of beingthe value corresponding to the input key phrase K2. Thetransformer-based model may rank the set of probabilities based on avalue of each probability in the set of probabilities. For example, theset of probabilities may be ranked in decreasing order, such thatcharacter strings having the highest probability of being a match arelisted first followed by character strings have lower probabilities. Inthe example of form F, the transformer-based model may rank theprobabilities as p2, p1, and p3. Such a ranking may indicate that thevalue V2 has the highest probability of being a value of the input keyphrase K2.

In some examples, the transformer-based model may determine that theinput key phrase does not match a key that corresponds to one or more ofthe set of input text fields. In the example of form F, the input keyphrase may include a key K5 which does not match the keys K1, K2 and K3included in the document. In such cases, the transformer-based model maygenerate or otherwise identify a dummy key corresponding to the inputkey phrase based on determining that the input key phrase does not matchthe key. In some examples, the transformer-based model may identify anapproximate match between the input key phrase (K5) and the keys (K1, K2and K3) included in the document. For instance, the transformer-basedmodel may match with K2. In such cases, the transformer-based model mayreturn V2 as the value for the input key phrase K5. In some examples,identifying the value for the key-value pair is based on identifying thedummy key. The transformer-based model may also determine that the inputkey phrase is associated with an empty value field. In some examples,the server may identify a dummy value corresponding to the input keyphrase based on determining that the input key phrase is associated withthe empty value field. In some cases, the identified value correspondingto the input key phrase may include the dummy value.

At 335, the server may perform a post processing operation to identifythe value for the key-value pair corresponding to the input key phrasebased on inputting the extracted set of character strings and the set oftwo-dimensional locations into the transformed-based model. For example,the server may identify the value for the key-value pair correspondingto the input key phrase in accordance with the probabilities ranked bythe transformer-based model. In some instances, identifying the valuefor the key-value pair corresponding to the input key phrase is based onranking the set of probabilities.

In some examples, the server may receive the probabilities of each valueand may generate a value phrase proposal. Values from the input documentmay contain multiple words. For example, the value V2 in form F mayinclude words v21, v22, and v23 (e.g., the value for the key K2 mayinclude several words and/or numbers), and as the post processing, theserver may generate proposals by grouping nearby extracted words iftheir horizontal and/or vertical distance is within some threshold. Forinstance, if the words v21, v22, and v23 are within a threshold vicinityof each other, then the server groups them as a single value phrase.Additionally or alternatively, the server may generate a probability ofeach proposal being included in the value for the input key phrase. Forexample, the server may determine whether the probability of eachproposal is the maximum of the extracted words' probabilities withinthis group. In the example of form F, the server may determine acombined probability of words v21, v22, and v23. If the probability ofwords v21, v22, and v23 is higher than the probability of remaininggroups, the server may then generate the value for the input key phrase(e.g., pick the proposal with the highest probability of this value). Ifthe probability is lower than a threshold, the server may refrain fromresponding. Thus, the server may group one or more character stringsinto a value phrase based on an output of the machine learned model andthe set of two-dimensional locations of the set of character strings. Asdepicted herein, identifying the value for the key-value pair may bebased on grouping the one or more character strings.

At 340, the server may transmit the identified value corresponding tothe input key phrase. For example, the server may determine a value forthe key-value pair corresponding to the input key phrase (received at310) and may transmit the value to the user device that input the inputkey phrase.

FIG. 4 illustrates an example of an input document 400 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure.

A user may submit the input document 400 to a system or application viaa user interface. In some examples, the input document 400 may be inputinto the system by another user or already stored in a database orsimilar storage system. The input document 400 may be a form having afixed format or a non-fixed format. A user may submit an input keyphrase querying a value for a key-value pair. The user may be associatedwith a tenant of a multi-tenant database which has been using the cloudplatform for data management. There may be several data stores of dataand metadata associated with the tenant which may be used to train amachine learning model. The trained machine learning model may be usedto process the query and generate a response.

The input document 400 may include multiple key-value pairs. Forexample, the input document 400 may include a key “bill to” and acorresponding value “John Smith 2 Court Square New York, N.Y. 12210.” Asanother example, the input document 400 includes a key 404 (“invoice #”)and a corresponding value 406 (“US-001”). As depicted with reference toFIG. 4 , the input document 400 may include a value without any keyassociated with it. For instance, the value 402 include “East RepairInc. 1912 Harvest Lane New York, N.Y. 12210.” However, the inputdocument 400 does not include a key corresponding to the value 402.

Once the user submits the input key phrase, the user interface receivingthe input key phrase may send the input key phrase to a database serveror some similar computing device or architecture running a machinelearning model component. In some examples, the user interface may senda natural language query to a database server with a machine learningmodel component. For example, the natural language query may beprocessed by the database server (e.g., device 210 described withreference to FIG. 2 ) and the database server may identify the input keyphrase which may correspond to the natural language query. In someexamples, the server may extract, using an optical character recognitionmodel, a set of character strings and a set of two-dimensional locationsof the set of character strings on a layout of the input document 400.For example, the server may extract the words or phrases “bill,” “to,”“ship,” “to,” “invoice #,” “US-001” and so on. The server may generatean array of words corresponding to the input document 400 and an arrayor their corresponding locations. For example, a reference location maybe established for the input document 400 (e.g., the lower left-handcorner), and the locations (e.g., with respect to an X, Y coordinatesystem or any other coordinate system) of the words “bill”, “to,”“ship,” “to,” etc. may be determined with respect to the referencelocation. The server may then determine the value corresponding to theinput key phrase based on one or more probability values computed by amachine learned model, as described in more detail with reference toFIG. 3 .

In the example of FIG. 4 , the user may submit an input key phrase as“invoice #.” In this example, the user is querying the value in the formassociated with the field “invoice #.” Using the optical characterrecognition model and the machine learned model, as described herein,the server may determine that the value 406 “US-001” is thecorresponding value to the input key phrase. In some examples, the inputkey phrase may not match any key in the input document 400. Forinstance, instead of inputting “invoice #” the user may input “invoicenumber.” The server may determine that the phrase “invoice number” is aclosest match to the key 404 “invoice #.” The technique for matching theinput key phrase to a key included in the input document 400 isdescribed in further details in FIG. 3 . Upon matching, the server mayreturn the value 406 “US-001” as the value of the input key phrase“invoice number.”

In some examples, the server may determine that the input key phrase isassociated with an empty value field. The server may identify a dummyvalue corresponding to the input key phrase based on determining thatthe input key phrase is associated with the empty value field. In suchinstances, the server may return the dummy value in response to an inputkey phrase that is associated with an empty value.

FIG. 5 illustrates an example of a process flow 500 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The process flow diagram 500 includesa user device 505 and a server 510. The user device 505 may be anexample of the user device 205 as described with respect to FIG. 2 , andthe server 510 may be an example of the server 210 as described withrespect to FIG. 2 . Although one user device 505 is depicted in theexample of FIG. 5 , it may be understood that the process flow 500 mayinclude multiple user devices 505. The server may represent a set ofcomputing components, data storage components, and the like, asdescribed herein, and the processing may occur across on or moremultiple devices. In some examples, the server 510 may support amulti-tenant database system as described herein. The processillustrated in FIG. 5 may be performed for various tenants of themultiple tenant system.

In some examples, the operations illustrated in the process flow 500 maybe performed by hardware (e.g., including circuitry, processing blocks,logic components, and other components), code (e.g., software orfirmware) executed by a processor, or any combination thereof.Alternative examples of the following may be implemented, where somesteps are performed in a different order than described or are notperformed at all. In some cases, steps may include additional featuresnot mentioned below, or further steps may be added.

At 515, the server 510 may receive an input document including a set ofinput text fields. In some examples, the input document may include afixed form, a non-fixed form, or both. The server 510 may receive theinput document through an upload or submission process via a userinterface.

At 520, the server 510 may receive an input key phrase querying a valuefor a key-value pair that corresponds to one or more of the set of inputtext fields. The server 510 may receive the input key phrase via theuser interface. The user interface used to submit the input document maybe the same or different than the user interface used to submit theinput key phrase.

At 525, the server 510 may extract, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document. At 530, the server 510 may compute a probabilitythat a character string of the set of character strings corresponds tothe value for the key-value pair corresponding to the input key phrase.In some examples, the server 510 may input the extracted set ofcharacter strings and the set of two-dimensional locations into amachine learned model that is trained to compute the probability that acharacter string of the set of character strings corresponds to thevalue for the key-value pair corresponding to the input key phrase. Insome cases, the server 510 may input the input key phrase into themachine learned model. The server 510 may then identify a set ofprobabilities for the set of character strings being the value for thekey-value pair corresponding to the input key phrase.

At 535, the server 510 may rank the set of probabilities based on avalue of each probability in the set of probabilities. At 540, theserver 510 may identify the value for the key-value pair correspondingto the input key phrase based on inputting the extracted set ofcharacter strings and the set of two-dimensional locations into themachine learned model. In some examples, identifying the value for thekey-value pair corresponding to the input key phrase may be based onranking the set of probabilities. Additionally or alternatively, theserver 510 may group one or more character strings into a value phrasebased on an output of the machine learned model and the set oftwo-dimensional locations of the set of character strings. In someexamples, identifying the value for the key-value pair is based ongrouping the one or more character strings. At 545, the server 510 maytransmit the identified value corresponding to the input key phrase tothe user device 505.

FIG. 6 shows a block diagram 600 of a device 605 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The device 605 may include an inputmodule 610, an output module 615, and a processing component 620. Thedevice 605 may also include a processor. Each of these components may bein communication with one another (e.g., via one or more buses).

The input module 610 may manage input signals for the device 605. Forexample, the input module 610 may identify input signals based on aninteraction with a modem, a keyboard, a mouse, a touchscreen, or asimilar device. These input signals may be associated with user input orprocessing at other components or devices. In some cases, the inputmodule 610 may utilize an operating system such as iOS®, ANDROID®,MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operatingsystem to handle input signals. The input module 610 may send aspects ofthese input signals to other components of the device 605 forprocessing. For example, the input module 610 may transmit input signalsto the processing component 620 to support processing forms usingartificial intelligence models. In some cases, the input module 610 maybe a component of an I/O controller 810 as described with reference toFIG. 8 .

The output module 615 may manage output signals for the device 605. Forexample, the output module 615 may receive signals from other componentsof the device 605, such as the processing component 620, and maytransmit these signals to other components or devices. In some examples,the output module 615 may transmit output signals for display in a userinterface, for storage in a database or data store, for furtherprocessing at a server or server cluster, or for any other processes atany number of devices or systems. In some cases, the output module 615may be a component of an I/O controller 810 as described with referenceto FIG. 8 .

For example, the processing component 620 may include a document inputcomponent 625, a key phrase component 630, an extraction component 635,a probability component 640, a value identification component 645, avalue transmission component 650, or any combination thereof. In someexamples, the processing component 620, or various components thereof,may be configured to perform various operations (e.g., receiving,monitoring, transmitting) using or otherwise in cooperation with theinput module 610, the output module 615, or both. For example, theprocessing component 620 may receive information from the input module610, send information to the output module 615, or be integrated incombination with the input module 610, the output module 615, or both toreceive information, transmit information, or perform various otheroperations as described herein.

The processing component 620 may support form processing at a server inaccordance with examples as disclosed herein. The document inputcomponent 625 may be configured as or otherwise support a means forreceiving an input document including a plurality of input text fields.The key phrase component 630 may be configured as or otherwise support ameans for receiving an input key phrase querying a value for a key-valuepair that corresponds to one or more of the plurality of input textfields. The extraction component 635 may be configured as or otherwisesupport a means for extracting, using an optical character recognitionmodel, a set of character strings and a set of two-dimensional locationsof the set of character strings on a layout of the input document. Theprobability component 640 may be configured as or otherwise support ameans for inputting the extracted set of character strings and the setof two-dimensional locations into a machine learned model that istrained to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase. The value identificationcomponent 645 may be configured as or otherwise support a means foridentifying the value for the key-value pair corresponding to the inputkey phrase based at least in part on the inputting. The valuetransmission component 650 may be configured as or otherwise support ameans for transmitting the identified value corresponding to the inputkey phrase.

FIG. 7 shows a block diagram 700 of a processing component 720 thatsupports processing forms using artificial intelligence models inaccordance with aspects of the present disclosure. The processingcomponent 720 may be an example of aspects of a processing component ora processing component 620, or both, as described herein. The processingcomponent 720, or various components thereof, may be an example of meansfor performing various aspects of processing forms using artificialintelligence models as described herein. For example, the processingcomponent 720 may include a document input component 725, a key phrasecomponent 730, an extraction component 735, a probability component 740,a value identification component 745, a value transmission component750, a grouping component 755, a matching component 760, a featurerepresentation component 765, a training component 770, or anycombination thereof. Each of these components may communicate, directlyor indirectly, with one another (e.g., via one or more buses).

The processing component 720 may support form processing at a server inaccordance with examples as disclosed herein. The document inputcomponent 725 may be configured as or otherwise support a means forreceiving an input document including a plurality of input text fields.The key phrase component 730 may be configured as or otherwise support ameans for receiving an input key phrase querying a value for a key-valuepair that corresponds to one or more of the plurality of input textfields. The extraction component 735 may be configured as or otherwisesupport a means for extracting, using an optical character recognitionmodel, a set of character strings and a set of two-dimensional locationsof the set of character strings on a layout of the input document. Theprobability component 740 may be configured as or otherwise support ameans for inputting the extracted set of character strings and the setof two-dimensional locations into a machine learned model that istrained to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase. The value identificationcomponent 745 may be configured as or otherwise support a means foridentifying the value for the key-value pair corresponding to the inputkey phrase based at least in part on the inputting. The valuetransmission component 750 may be configured as or otherwise support ameans for transmitting the identified value corresponding to the inputkey phrase.

In some examples, the key phrase component 730 may be configured as orotherwise support a means for inputting the input key phrase into themachine learned model. In some examples, the probability component 740may be configured as or otherwise support a means for identifying a setof probabilities for the set of character strings being the value forthe key-value pair corresponding to the input key phrase.

In some examples, the probability component 740 may be configured as orotherwise support a means for ranking the set of probabilities based atleast in part on a value of each probability in the set ofprobabilities, wherein identifying the value for the key-value paircorresponding to the input key phrase is based at least in part onranking the set of probabilities.

In some examples, the grouping component 755 may be configured as orotherwise support a means for grouping one or more character stringsinto a value phrase based at least in part on an output of the machinelearned model and the set of two-dimensional locations of the set ofcharacter strings, wherein identifying the value for the key-value pairis based at least in part on grouping the one or more character strings.

In some examples, the matching component 760 may be configured as orotherwise support a means for determining that the input key phrase doesnot match a key that corresponds to one or more of the plurality ofinput text fields. In some examples, the value identification component745 may be configured as or otherwise support a means for identifying adummy key corresponding to the input key phrase based at least in parton determining that the input key phrase does not match the key, whereinidentifying the value for the key-value pair is based at least in parton identifying the dummy key.

In some examples, the value identification component 745 may beconfigured as or otherwise support a means for determining that theinput key phrase is associated with an empty value field. In someexamples, the value identification component 745 may be configured as orotherwise support a means for identifying a dummy value corresponding tothe input key phrase based at least in part on determining that theinput key phrase is associated with the empty value field, wherein theidentified value corresponding to the input key phrase comprises thedummy value.

In some examples, the feature representation component 765 may beconfigured as or otherwise support a means for generating a first set offeature representations for a set of keywords included in the pluralityof input text fields and a second set of feature representations for theextracted set of character strings. In some examples, the featurerepresentation component 765 may be configured as or otherwise support ameans for generating an unified feature representation for the input keyphrase based at least in part on the first set of featurerepresentations and the second set of feature representations, whereinidentifying the value for the key-value pair corresponding to the inputkey phrase is based at least in part on the unified featurerepresentation.

In some examples, the feature representation component 765 may beconfigured as or otherwise support a means for applying a dot productbetween the unified feature representation for the input key phrase andeach feature representation of the second set of featurerepresentations, wherein the probability that the character string ofthe set of character strings corresponds to the value for the key-valuepair corresponding to the input key phrase is computed based at least inpart on applying the dot product.

In some examples, the training component 770 may be configured as orotherwise support a means for training the machine learned model basedat least in part on inputting a plurality of input file formats into themachine learned model. In some examples, the input document comprises afixed form, a non-fixed form, or both. In some examples, the machinelearned model comprises a transformer-based machine learned model.

FIG. 8 shows a diagram of a system 800 including a device 805 thatsupports processing forms using artificial intelligence models inaccordance with aspects of the present disclosure. The device 805 may bean example of or include the components of a device 605 as describedherein. The device 805 may include components for bi-directional datacommunications including components for transmitting and receivingcommunications, such as a processing component 820, an I/O controller810, a database controller 815, a memory 825, a processor 830, and adatabase 835. These components may be in electronic communication orotherwise coupled (e.g., operatively, communicatively, functionally,electronically, electrically) via one or more buses (e.g., a bus 840).

The I/O controller 810 may manage input signals 845 and output signals850 for the device 805. The I/O controller 810 may also manageperipherals not integrated into the device 805. In some cases, the I/Ocontroller 810 may represent a physical connection or port to anexternal peripheral. In some cases, the I/O controller 810 may utilizean operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®,UNIX®, LINUX®, or another known operating system. In other cases, theI/O controller 810 may represent or interact with a modem, a keyboard, amouse, a touchscreen, or a similar device. In some cases, the I/Ocontroller 810 may be implemented as part of a processor 830. In someexamples, a user may interact with the device 805 via the I/O controller810 or via hardware components controlled by the I/O controller 810.

The database controller 815 may manage data storage and processing in adatabase 835. In some cases, a user may interact with the databasecontroller 815. In other cases, the database controller 815 may operateautomatically without user interaction. The database 835 may be anexample of a single database, a distributed database, multipledistributed databases, a data store, a data lake, or an emergency backupdatabase.

Memory 825 may include random-access memory (RAM) and ROM. The memory825 may store computer-readable, computer-executable software includinginstructions that, when executed, cause the processor 830 to performvarious functions described herein. In some cases, the memory 825 maycontain, among other things, a BIOS which may control basic hardware orsoftware operation such as the interaction with peripheral components ordevices.

The processor 830 may include an intelligent hardware device, (e.g., ageneral-purpose processor, a DSP, a CPU, a microcontroller, an ASIC, anFPGA, a programmable logic device, a discrete gate or transistor logiccomponent, a discrete hardware component, or any combination thereof).In some cases, the processor 830 may be configured to operate a memoryarray using a memory controller. In other cases, a memory controller maybe integrated into the processor 830. The processor 830 may beconfigured to execute computer-readable instructions stored in a memory825 to perform various functions (e.g., functions or tasks supportingprocessing forms using artificial intelligence models).

The processing component 820 may support form processing at a server inaccordance with examples as disclosed herein. For example, theprocessing component 820 may be configured as or otherwise support ameans for receiving an input document including a plurality of inputtext fields. The processing component 820 may be configured as orotherwise support a means for receiving an input key phrase querying avalue for a key-value pair that corresponds to one or more of theplurality of input text fields. The processing component 820 may beconfigured as or otherwise support a means for extracting, using anoptical character recognition model, a set of character strings and aset of two-dimensional locations of the set of character strings on alayout of the input document. The processing component 820 may beconfigured as or otherwise support a means for inputting the extractedset of character strings and the set of two-dimensional locations into amachine learned model that is trained to compute a probability that acharacter string of the set of character strings corresponds to thevalue for the key-value pair corresponding to the input key phrase. Theprocessing component 820 may be configured as or otherwise support ameans for identifying the value for the key-value pair corresponding tothe input key phrase based at least in part on the inputting. Theprocessing component 820 may be configured as or otherwise support ameans for transmitting the identified value corresponding to the inputkey phrase.

By including or configuring the processing component 820 in accordancewith examples as described herein, the device 805 may support techniquesfor handling different types of forms, dummy values, and keys withoutvalues, and improved user experience related to processing documentswithout a predefined template.

FIG. 9 shows a flowchart illustrating a method 900 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The operations of the method 900 maybe implemented by an application server or its components as describedherein. For example, the operations of the method 900 may be performedby an application server as described with reference to FIGS. 1 through8 . In some examples, an application server may execute a set ofinstructions to control the functional elements of the applicationserver to perform the described functions. Additionally oralternatively, the application server may perform aspects of thedescribed functions using special-purpose hardware.

At 905, the method may include receiving an input document including aplurality of input text fields. The operations of 905 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 905 may be performed by a document inputcomponent 725 as described with reference to FIG. 7 .

At 910, the method may include receiving an input key phrase querying avalue for a key-value pair that corresponds to one or more of theplurality of input text fields. The operations of 910 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 910 may be performed by a key phrasecomponent 730 as described with reference to FIG. 7 .

At 915, the method may include extracting, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document. The operations of 915 may be performed in accordancewith examples as disclosed herein. In some examples, aspects of theoperations of 915 may be performed by an extraction component 735 asdescribed with reference to FIG. 7 .

At 920, the method may include inputting the extracted set of characterstrings and the set of two-dimensional locations into a machine learnedmodel that is trained to compute a probability that a character stringof the set of character strings corresponds to the value for thekey-value pair corresponding to the input key phrase. The operations of920 may be performed in accordance with examples as disclosed herein. Insome examples, aspects of the operations of 920 may be performed by aprobability component 740 as described with reference to FIG. 7 .

At 925, the method may include identifying the value for the key-valuepair corresponding to the input key phrase based at least in part on theinputting. The operations of 925 may be performed in accordance withexamples as disclosed herein. In some examples, aspects of theoperations of 925 may be performed by a value identification component745 as described with reference to FIG. 7 .

At 930, the method may include transmitting the identified valuecorresponding to the input key phrase. The operations of 930 may beperformed in accordance with examples as disclosed herein. In someexamples, aspects of the operations of 930 may be performed by a valuetransmission component 750 as described with reference to FIG. 7 .

FIG. 10 shows a flowchart illustrating a method 1000 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The operations of the method 1000 maybe implemented by an application server or its components as describedherein. For example, the operations of the method 1000 may be performedby an application server as described with reference to FIGS. 1 through8 . In some examples, an application server may execute a set ofinstructions to control the functional elements of the applicationserver to perform the described functions. Additionally oralternatively, the application server may perform aspects of thedescribed functions using special-purpose hardware.

At 1005, the method may include receiving an input document including aplurality of input text fields. The operations of 1005 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 1005 may be performed by a document inputcomponent 725 as described with reference to FIG. 7 .

At 1010, the method may include receiving an input key phrase querying avalue for a key-value pair that corresponds to one or more of theplurality of input text fields. The operations of 1010 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 1010 may be performed by a key phrasecomponent 730 as described with reference to FIG. 7 .

At 1015, the method may include extracting, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document. The operations of 1015 may be performed inaccordance with examples as disclosed herein. In some examples, aspectsof the operations of 1015 may be performed by an extraction component735 as described with reference to FIG. 7 .

At 1020, the method may include inputting the extracted set of characterstrings and the set of two-dimensional locations into a machine learnedmodel that is trained to compute a probability that a character stringof the set of character strings corresponds to the value for thekey-value pair corresponding to the input key phrase. The operations of1020 may be performed in accordance with examples as disclosed herein.In some examples, aspects of the operations of 1020 may be performed bya probability component 740 as described with reference to FIG. 7 .

At 1025, the method may include inputting the input key phrase into themachine learned model. The operations of 1025 may be performed inaccordance with examples as disclosed herein. In some examples, aspectsof the operations of 1025 may be performed by a key phrase component 730as described with reference to FIG. 7 .

At 1030, the method may include identifying a set of probabilities forthe set of character strings being the value for the key-value paircorresponding to the input key phrase. The operations of 1030 may beperformed in accordance with examples as disclosed herein. In someexamples, aspects of the operations of 1030 may be performed by aprobability component 740 as described with reference to FIG. 7 .

At 1035, the method may include ranking the set of probabilities basedat least in part on a value of each probability in the set ofprobabilities, wherein identifying the value for the key-value paircorresponding to the input key phrase is based at least in part onranking the set of probabilities. The operations of 1035 may beperformed in accordance with examples as disclosed herein. In someexamples, aspects of the operations of 1035 may be performed by aprobability component 740 as described with reference to FIG. 7 .

At 1040, the method may include identifying the value for the key-valuepair corresponding to the input key phrase based at least in part on theinputting. The operations of 1040 may be performed in accordance withexamples as disclosed herein. In some examples, aspects of theoperations of 1040 may be performed by a value identification component745 as described with reference to FIG. 7 .

At 1045, the method may include transmitting the identified valuecorresponding to the input key phrase. The operations of 1045 may beperformed in accordance with examples as disclosed herein. In someexamples, aspects of the operations of 1045 may be performed by a valuetransmission component 750 as described with reference to FIG. 7 .

FIG. 11 shows a flowchart illustrating a method 1100 that supportsprocessing forms using artificial intelligence models in accordance withaspects of the present disclosure. The operations of the method 1100 maybe implemented by an application server or its components as describedherein. For example, the operations of the method 1100 may be performedby an application server as described with reference to FIGS. 1 through8 . In some examples, an application server may execute a set ofinstructions to control the functional elements of the applicationserver to perform the described functions. Additionally oralternatively, the application server may perform aspects of thedescribed functions using special-purpose hardware.

At 1105, the method may include receiving an input document including aplurality of input text fields. The operations of 1105 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 1105 may be performed by a document inputcomponent 725 as described with reference to FIG. 7 .

At 1110, the method may include receiving an input key phrase querying avalue for a key-value pair that corresponds to one or more of theplurality of input text fields. The operations of 1110 may be performedin accordance with examples as disclosed herein. In some examples,aspects of the operations of 1110 may be performed by a key phrasecomponent 730 as described with reference to FIG. 7 .

At 1115, the method may include extracting, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document. The operations of 1115 may be performed inaccordance with examples as disclosed herein. In some examples, aspectsof the operations of 1115 may be performed by an extraction component735 as described with reference to FIG. 7 .

At 1120, the method may include inputting the extracted set of characterstrings and the set of two-dimensional locations into a machine learnedmodel that is trained to compute a probability that a character stringof the set of character strings corresponds to the value for thekey-value pair corresponding to the input key phrase. The operations of1120 may be performed in accordance with examples as disclosed herein.In some examples, aspects of the operations of 1120 may be performed bya probability component 740 as described with reference to FIG. 7 .

At 1125, the method may include determining that the input key phrasedoes not match a key that corresponds to one or more of the plurality ofinput text fields. The operations of 1125 may be performed in accordancewith examples as disclosed herein. In some examples, aspects of theoperations of 1125 may be performed by a matching component 760 asdescribed with reference to FIG. 7 .

At 1130, the method may include identifying a dummy key corresponding tothe input key phrase based at least in part on determining that theinput key phrase does not match the key, wherein identifying the valuefor the key-value pair is based at least in part on identifying thedummy key. The operations of 1130 may be performed in accordance withexamples as disclosed herein. In some examples, aspects of theoperations of 1130 may be performed by a value identification component745 as described with reference to FIG. 7 .

At 1135, the method may include identifying the value for the key-valuepair corresponding to the input key phrase based at least in part on theinputting. The operations of 1135 may be performed in accordance withexamples as disclosed herein. In some examples, aspects of theoperations of 1135 may be performed by a value identification component745 as described with reference to FIG. 7 .

At 1140, the method may include transmitting the identified valuecorresponding to the input key phrase. The operations of 1140 may beperformed in accordance with examples as disclosed herein. In someexamples, aspects of the operations of 1140 may be performed by a valuetransmission component 750 as described with reference to FIG. 7 .

A method for form processing at a server is described. The method mayinclude receiving an input document including a plurality of input textfields, receiving an input key phrase querying a value for a key-valuepair that corresponds to one or more of the plurality of input textfields, extracting, using an optical character recognition model, a setof character strings and a set of two-dimensional locations of the setof character strings on a layout of the input document, inputting theextracted set of character strings and the set of two-dimensionallocations into a machine learned model that is trained to compute aprobability that a character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase, identifying the value for the key-value paircorresponding to the input key phrase based at least in part on theinputting, and transmitting the identified value corresponding to theinput key phrase.

An apparatus for form processing at a server is described. The apparatusmay include a processor, memory coupled with the processor, andinstructions stored in the memory. The instructions may be executable bythe processor to cause the apparatus to receive an input documentincluding a plurality of input text fields, receive an input key phrasequerying a value for a key-value pair that corresponds to one or more ofthe plurality of input text fields, extract, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document, input the extracted set of character strings and theset of two-dimensional locations into a machine learned model that istrained to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase, identify the value for thekey-value pair corresponding to the input key phrase based at least inpart on the inputting, and transmit the identified value correspondingto the input key phrase.

Another apparatus for form processing at a server is described. Theapparatus may include means for receiving an input document including aplurality of input text fields, means for receiving an input key phrasequerying a value for a key-value pair that corresponds to one or more ofthe plurality of input text fields, means for extracting, using anoptical character recognition model, a set of character strings and aset of two-dimensional locations of the set of character strings on alayout of the input document, means for inputting the extracted set ofcharacter strings and the set of two-dimensional locations into amachine learned model that is trained to compute a probability that acharacter string of the set of character strings corresponds to thevalue for the key-value pair corresponding to the input key phrase,means for identifying the value for the key-value pair corresponding tothe input key phrase based at least in part on the inputting, and meansfor transmitting the identified value corresponding to the input keyphrase.

A non-transitory computer-readable medium storing code for formprocessing at a server is described. The code may include instructionsexecutable by a processor to receive an input document including aplurality of input text fields, receive an input key phrase querying avalue for a key-value pair that corresponds to one or more of theplurality of input text fields, extract, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document, input the extracted set of character strings and theset of two-dimensional locations into a machine learned model that istrained to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase, identify the value for thekey-value pair corresponding to the input key phrase based at least inpart on the inputting, and transmit the identified value correspondingto the input key phrase.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for inputting the input keyphrase into the machine learned model and identifying a set ofprobabilities for the set of character strings being the value for thekey-value pair corresponding to the input key phrase.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for ranking the set ofprobabilities based at least in part on a value of each probability inthe set of probabilities, wherein identifying the value for thekey-value pair corresponding to the input key phrase may be based atleast in part on ranking the set of probabilities.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for grouping one or morecharacter strings into a value phrase based at least in part on anoutput of the machine learned model and the set of two-dimensionallocations of the set of character strings, wherein identifying the valuefor the key-value pair may be based at least in part on grouping the oneor more character strings.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for determining that theinput key phrase does not match a key that corresponds to one or more ofthe plurality of input text fields and identifying a dummy keycorresponding to the input key phrase based at least in part ondetermining that the input key phrase does not match the key, whereinidentifying the value for the key-value pair may be based at least inpart on identifying the dummy key.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for determining that theinput key phrase may be associated with an empty value field andidentifying a dummy value corresponding to the input key phrase based atleast in part on determining that the input key phrase may be associatedwith the empty value field, wherein the identified value correspondingto the input key phrase comprises the dummy value.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for generating a first setof feature representations for a set of keywords included in theplurality of input text fields and a second set of featurerepresentations for the extracted set of character strings andgenerating an unified feature representation for the input key phrasebased at least in part on the first set of feature representations andthe second set of feature representations, wherein identifying the valuefor the key-value pair corresponding to the input key phrase may bebased at least in part on the unified feature representation.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for applying a dot productbetween the unified feature representation for the input key phrase andeach feature representation of the second set of featurerepresentations, wherein the probability that the character string ofthe set of character strings corresponds to the value for the key-valuepair corresponding to the input key phrase may be computed based atleast in part on applying the dot product.

Some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein may further includeoperations, features, means, or instructions for training the machinelearned model based at least in part on inputting a plurality of inputfile formats into the machine learned model.

In some examples of the method, apparatuses, and non-transitorycomputer-readable medium described herein, the input document comprisesa fixed form, a non-fixed form, or both. In some examples of the method,apparatuses, and non-transitory computer-readable medium describedherein, the machine learned model comprises a transformer-based machinelearned model.

It should be noted that the methods described above describe possibleimplementations, and that the operations and the steps may be rearrangedor otherwise modified and that other implementations are possible.Furthermore, aspects from two or more of the methods may be combined.

The description set forth herein, in connection with the appendeddrawings, describes example configurations and does not represent allthe examples that may be implemented or that are within the scope of theclaims. The term “exemplary” used herein means “serving as an example,instance, or illustration,” and not “preferred” or “advantageous overother examples.” The detailed description includes specific details forthe purpose of providing an understanding of the described techniques.These techniques, however, may be practiced without these specificdetails. In some instances, well-known structures and devices are shownin block diagram form in order to avoid obscuring the concepts of thedescribed examples.

In the appended figures, similar components or features may have thesame reference label. Further, various components of the same type maybe distinguished by following the reference label by a dash and a secondlabel that distinguishes among the similar components. If just the firstreference label is used in the specification, the description isapplicable to any one of the similar components having the same firstreference label irrespective of the second reference label.

Information and signals described herein may be represented using any ofa variety of different technologies and techniques. For example, data,instructions, commands, information, signals, bits, symbols, and chipsthat may be referenced throughout the above description may berepresented by voltages, currents, electromagnetic waves, magneticfields or particles, optical fields or particles, or any combinationthereof.

The various illustrative blocks and modules described in connection withthe disclosure herein may be implemented or performed with ageneral-purpose processor, a DSP, an ASIC, an FPGA or other programmablelogic device, discrete gate or transistor logic, discrete hardwarecomponents, or any combination thereof designed to perform the functionsdescribed herein. A general-purpose processor may be a microprocessor,but in the alternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing devices (e.g., a combinationof a DSP and a microprocessor, multiple microprocessors, one or moremicroprocessors in conjunction with a DSP core, or any other suchconfiguration).

The functions described herein may be implemented in hardware, softwareexecuted by a processor, firmware, or any combination thereof. Ifimplemented in software executed by a processor, the functions may bestored on or transmitted over as one or more instructions or code on acomputer-readable medium. Other examples and implementations are withinthe scope of the disclosure and appended claims. For example, due to thenature of software, functions described herein can be implemented usingsoftware executed by a processor, hardware, firmware, hardwiring, orcombinations of any of these. Features implementing functions may alsobe physically located at various positions, including being distributedsuch that portions of functions are implemented at different physicallocations. Also, as used herein, including in the claims, “or” as usedin a list of items (for example, a list of items prefaced by a phrasesuch as “at least one of” or “one or more of”) indicates an inclusivelist such that, for example, a list of at least one of A, B, or C meansA or B or C or AB or AC or BC or ABC (i.e., A and B and C). Also, asused herein, the phrase “based on” shall not be construed as a referenceto a closed set of conditions. For example, an exemplary step that isdescribed as “based on condition A” may be based on both a condition Aand a condition B without departing from the scope of the presentdisclosure. In other words, as used herein, the phrase “based on” shallbe construed in the same manner as the phrase “based at least in parton.”

Computer-readable media includes both non-transitory computer storagemedia and communication media including any medium that facilitatestransfer of a computer program from one place to another. Anon-transitory storage medium may be any available medium that can beaccessed by a general purpose or special purpose computer. By way ofexample, and not limitation, non-transitory computer-readable media cancomprise RAM, ROM, electrically erasable programmable ROM (EEPROM),compact disk (CD) ROM or other optical disk storage, magnetic diskstorage or other magnetic storage devices, or any other non-transitorymedium that can be used to carry or store desired program code means inthe form of instructions or data structures and that can be accessed bya general-purpose or special-purpose computer, or a general-purpose orspecial-purpose processor. Also, any connection is properly termed acomputer-readable medium. For example, if the software is transmittedfrom a website, server, or other remote source using a coaxial cable,fiber optic cable, twisted pair, digital subscriber line (DSL), orwireless technologies such as infrared, radio, and microwave, then thecoaxial cable, fiber optic cable, twisted pair, DSL, or wirelesstechnologies such as infrared, radio, and microwave are included in thedefinition of medium. Disk and disc, as used herein, include CD, laserdisc, optical disc, digital versatile disc (DVD), floppy disk andBlu-ray disc where disks usually reproduce data magnetically, whilediscs reproduce data optically with lasers. Combinations of the aboveare also included within the scope of computer-readable media.

The description herein is provided to enable a person skilled in the artto make or use the disclosure. Various modifications to the disclosurewill be readily apparent to those skilled in the art, and the genericprinciples defined herein may be applied to other variations withoutdeparting from the scope of the disclosure. Thus, the disclosure is notlimited to the examples and designs described herein, but is to beaccorded the broadest scope consistent with the principles and novelfeatures disclosed herein.

What is claimed is:
 1. A method for form processing, comprising:receiving an input document including a plurality of input text fields;receiving an input key phrase querying a value for a key-value pair thatcorresponds to one or more of the plurality of input text fields;extracting, using an optical character recognition model, a set ofcharacter strings and a set of two-dimensional locations of the set ofcharacter strings on a layout of the input document; inputting theextracted set of character strings and the set of two-dimensionallocations into a machine learned model that is trained to compute aprobability that a character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase; identifying the value for the key-value paircorresponding to the input key phrase based at least in part on theinputting; and transmitting the identified value corresponding to theinput key phrase.
 2. The method of claim 1, further comprising:inputting the input key phrase into the machine learned model; andidentifying a set of probabilities for the set of character stringsbeing the value for the key-value pair corresponding to the input keyphrase.
 3. The method of claim 2, further comprising: ranking the set ofprobabilities based at least in part on a value of each probability inthe set of probabilities, wherein identifying the value for thekey-value pair corresponding to the input key phrase is based at leastin part on ranking the set of probabilities.
 4. The method of claim 1,further comprising: grouping one or more character strings into a valuephrase based at least in part on an output of the machine learned modeland the set of two-dimensional locations of the set of characterstrings, wherein identifying the value for the key-value pair is basedat least in part on grouping the one or more character strings.
 5. Themethod of claim 1, further comprising: determining that the input keyphrase does not match a key that corresponds to one or more of theplurality of input text fields; and identifying a dummy keycorresponding to the input key phrase based at least in part ondetermining that the input key phrase does not match the key, whereinidentifying the value for the key-value pair is based at least in parton identifying the dummy key.
 6. The method of claim 1, furthercomprising: determining that the input key phrase is associated with anempty value field; and identifying a dummy value corresponding to theinput key phrase based at least in part on determining that the inputkey phrase is associated with the empty value field, wherein theidentified value corresponding to the input key phrase comprises thedummy value.
 7. The method of claim 1, further comprising: generating afirst set of feature representations for a set of keywords included inthe plurality of input text fields and a second set of featurerepresentations for the extracted set of character strings; andgenerating an unified feature representation for the input key phrasebased at least in part on the first set of feature representations andthe second set of feature representations, wherein identifying the valuefor the key-value pair corresponding to the input key phrase is based atleast in part on the unified feature representation.
 8. The method ofclaim 7, further comprising: applying a dot product between the unifiedfeature representation for the input key phrase and each featurerepresentation of the second set of feature representations, wherein theprobability that the character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase is computed based at least in part on applying the dotproduct.
 9. The method of claim 1, further comprising: training themachine learned model based at least in part on inputting a plurality ofinput file formats into the machine learned model.
 10. The method ofclaim 1, wherein the input document comprises a fixed form, a non-fixedform, or both.
 11. The method of claim 1, wherein the machine learnedmodel comprises a transformer-based machine learned model.
 12. Anapparatus for form processing, comprising: a processor; memory coupledwith the processor; and instructions stored in the memory and executableby the processor to cause the apparatus to: receive an input documentincluding a plurality of input text fields; receive an input key phrasequerying a value for a key-value pair that corresponds to one or more ofthe plurality of input text fields; extract, using an optical characterrecognition model, a set of character strings and a set oftwo-dimensional locations of the set of character strings on a layout ofthe input document; input the extracted set of character strings and theset of two-dimensional locations into a machine learned model that istrained to compute a probability that a character string of the set ofcharacter strings corresponds to the value for the key-value paircorresponding to the input key phrase; identify the value for thekey-value pair corresponding to the input key phrase based at least inpart on the inputting; and transmit the identified value correspondingto the input key phrase.
 13. The apparatus of claim 12, wherein theinstructions are further executable by the processor to cause theapparatus to: input the input key phrase into the machine learned model;and identify a set of probabilities for the set of character stringsbeing the value for the key-value pair corresponding to the input keyphrase.
 14. The apparatus of claim 13, wherein the instructions arefurther executable by the processor to cause the apparatus to: rank theset of probabilities based at least in part on a value of eachprobability in the set of probabilities, wherein identifying the valuefor the key-value pair corresponding to the input key phrase is based atleast in part on ranking the set of probabilities.
 15. The apparatus ofclaim 12, wherein the instructions are further executable by theprocessor to cause the apparatus to: group one or more character stringsinto a value phrase based at least in part on an output of the machinelearned model and the set of two-dimensional locations of the set ofcharacter strings, wherein identifying the value for the key-value pairis based at least in part on grouping the one or more character strings.16. The apparatus of claim 12, wherein the instructions are furtherexecutable by the processor to cause the apparatus to: determine thatthe input key phrase does not match a key that corresponds to one ormore of the plurality of input text fields; and identify a dummy keycorresponding to the input key phrase based at least in part ondetermining that the input key phrase does not match the key, whereinidentifying the value for the key-value pair is based at least in parton identifying the dummy key.
 17. The apparatus of claim 12, wherein theinstructions are further executable by the processor to cause theapparatus to: determine that the input key phrase is associated with anempty value field; and identify a dummy value corresponding to the inputkey phrase based at least in part on determining that the input keyphrase is associated with the empty value field, wherein the identifiedvalue corresponding to the input key phrase comprises the dummy value.18. The apparatus of claim 12, wherein the instructions are furtherexecutable by the processor to cause the apparatus to: generate a firstset of feature representations for a set of keywords included in theplurality of input text fields and a second set of featurerepresentations for the extracted set of character strings; and generatean unified feature representation for the input key phrase based atleast in part on the first set of feature representations and the secondset of feature representations, wherein identifying the value for thekey-value pair corresponding to the input key phrase is based at leastin part on the unified feature representation.
 19. The apparatus ofclaim 18, wherein the instructions are further executable by theprocessor to cause the apparatus to: apply a dot product between theunified feature representation for the input key phrase and each featurerepresentation of the second set of feature representations, wherein theprobability that the character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase is computed based at least in part on applying the dotproduct.
 20. A non-transitory computer-readable medium storing code forform processing, the code comprising instructions executable by aprocessor to: receive an input document including a plurality of inputtext fields; receive an input key phrase querying a value for akey-value pair that corresponds to one or more of the plurality of inputtext fields; extract, using an optical character recognition model, aset of character strings and a set of two-dimensional locations of theset of character strings on a layout of the input document; input theextracted set of character strings and the set of two-dimensionallocations into a machine learned model that is trained to compute aprobability that a character string of the set of character stringscorresponds to the value for the key-value pair corresponding to theinput key phrase; identify the value for the key-value paircorresponding to the input key phrase based at least in part on theinputting; and transmit the identified value corresponding to the inputkey phrase.